1 00:00:09,784 --> 00:00:11,867 - Okay let's get started. 2 00:00:13,038 --> 00:00:15,888 Alright, so welcome to lecture 14, 3 00:00:15,888 --> 00:00:20,884 and today we'll be talking about reinforcement learning. 4 00:00:20,884 --> 00:00:23,222 So some administrative details first, 5 00:00:23,222 --> 00:00:30,346 update on grades. Midterm grades were released last night, so see Piazza for more information and statistics about that. 6 00:00:30,346 --> 00:00:35,402 And we also have A2 and milestone grades scheduled for later this week. 7 00:00:36,768 --> 00:00:40,682 Also, about your projects, all teams must register your projects. 8 00:00:40,682 --> 00:00:47,580 So on Piazza we have a form posted, so you should go there and this is required, every team should go and fill out this form with information 9 00:00:47,580 --> 00:00:53,214 about your project, that we'll use for final grading and the poster session. 10 00:00:53,214 --> 00:01:01,779 And the Tiny ImageNet evaluation servers are also now online for those of you who are doing the Tiny ImageNet challenge. 11 00:01:01,779 --> 00:01:06,193 We also have a link to a course survey on Piazza that was released a few days ago, 12 00:01:06,193 --> 00:01:13,600 so, please fill it out if you guys haven't already. We'd love to have your feedback and know how we can improve this class. 13 00:01:16,589 --> 00:01:19,650 Okay, so the topic of today, reinforcement learning. 14 00:01:19,650 --> 00:01:22,544 Alright, so so far we've talked about supervised learning, 15 00:01:22,544 --> 00:01:30,498 which is about a type of problem where we have data x and then we have labels y and our goal is to learn a function that is mapping from x to y. 16 00:01:30,498 --> 00:01:35,067 So, for example, the classification problem that we've been working with. 17 00:01:35,067 --> 00:01:37,753 We also talked last lecture about unsupervised learning, 18 00:01:37,753 --> 00:01:45,362 which is the problem where we have just data and no labels, and our goal is to learn some underlying, hidden structure of the data. 19 00:01:45,362 --> 00:01:50,528 So, an example of this is the generative models that we talked about last lecture. 20 00:01:52,040 --> 00:01:57,370 And so today we're going to talk about a different kind of problem set-up, the reinforcement learning problem. 21 00:01:57,370 --> 00:02:01,824 And so here we have an agent that can take actions in its environment, 22 00:02:01,824 --> 00:02:04,352 and it can receive rewards for for its action. 23 00:02:04,352 --> 00:02:09,959 And its goal is going to be to learn how to take actions in a way that can maximize its reward. 24 00:02:09,959 --> 00:02:14,101 And so we'll talk about this in a lot more detail today. 25 00:02:14,101 --> 00:02:18,116 So, the outline for today, we're going to first talk about the reinforcement learning problem, 26 00:02:18,116 --> 00:02:20,927 and then we'll talk about Markov decision processes, 27 00:02:20,927 --> 00:02:24,747 which is a formalism of the reinforcement learning problem, 28 00:02:24,747 --> 00:02:31,095 and then we'll talk about two major classes of RL algorithms, Q-learning and policy gradients. 29 00:02:32,876 --> 00:02:38,936 So, in the reinforcement learning set up, what we have is we have an agent and we have an environment. 30 00:02:38,936 --> 00:02:43,268 And so the environment gives the agent a state. 31 00:02:43,268 --> 00:02:46,877 In turn, the agent is going to take an action, 32 00:02:46,877 --> 00:02:52,609 and then the environment is going to give back a reward, as well as the next state. 33 00:02:52,609 --> 00:03:00,918 And so this is going to keep going on in this loop, on and on, until the environment gives back a terminal state, which then ends the episode. 34 00:03:00,918 --> 00:03:03,401 So, let's see some examples of this. 35 00:03:03,401 --> 00:03:05,536 First we have here the cart-pole problem, 36 00:03:05,536 --> 00:03:11,142 which is a classic problem that some of you may have seen, in, for example, 229 before. 37 00:03:11,142 --> 00:03:16,252 And so this objective here is that you want to balance a pole on top of a movable cart. 38 00:03:16,252 --> 00:03:20,280 Alright, so the state that you have here is your current description of the system. 39 00:03:20,280 --> 00:03:28,206 So, for example, angular, angular speed of your pole, your position, and the horizontal velocity of your cart. 40 00:03:28,206 --> 00:03:33,224 And the actions you can take are horizontal forces that you apply onto the cart, right? 41 00:03:33,224 --> 00:03:38,387 So you're basically trying to move this cart around to try and balance this pole on top of it. 42 00:03:38,387 --> 00:03:43,990 And the reward that you're getting from this environment is one at each time step if your pole is upright. 43 00:03:43,990 --> 00:03:48,143 So you basically want to keep this pole balanced for as long as you can. 44 00:03:49,286 --> 00:03:52,192 Okay, so here's another example of a classic RL problem. 45 00:03:52,192 --> 00:03:53,998 Here is robot locomotion. 46 00:03:53,998 --> 00:03:59,670 So we have here an example of a humanoid robot, as well as an ant robot model. 47 00:03:59,670 --> 00:04:03,128 And our objective here is to make the robot move forward. 48 00:04:03,128 --> 00:04:10,807 And so the state that we have describing our system is the angle and the positions of all the joints of our robots. 49 00:04:10,807 --> 00:04:15,887 And then the actions that we can take are the torques applied onto these joints, 50 00:04:15,887 --> 00:04:21,228 right, and so these are trying to make the robot move forward and then the reward that we get is 51 00:04:21,228 --> 00:04:31,701 our forward movement as well as, I think, in the time of, in the case of the humanoid, also, you can have something like a reward of one for each time step that this robot is upright. 52 00:04:33,521 --> 00:04:38,384 So, games are also a big class of problems that can be formulated with RL. 53 00:04:38,384 --> 00:04:40,700 So, for example, here we have Atari games 54 00:04:40,700 --> 00:04:44,280 which are a classic success of deep reinforcement learning 55 00:04:44,280 --> 00:04:48,574 and so here the objective is to complete these games with the highest possible score, right. 56 00:04:48,574 --> 00:04:52,753 So, your agent is basically a player that's trying to play these games. 57 00:04:52,753 --> 00:04:57,506 And the state that you have is going to be the raw pixels of the game state. 58 00:04:57,506 --> 00:05:02,882 Right, so these are just the pixels on the screen that you would see as you're playing the game. 59 00:05:02,882 --> 00:05:09,912 And then the actions that you have are your game controls, so for example, in some games maybe moving left to right, up or down. 60 00:05:09,912 --> 00:05:15,667 And then the score that you have is your score increase or decrease at each time step, and your goal is going to be 61 00:05:15,667 --> 00:05:27,572 to maximize your total score over the course of the game. And, finally, here we have another example of a game here. It's Go, which is 62 00:05:27,573 --> 00:05:31,697 something that was a huge achievement of deep reinforcement learning last year, 63 00:05:31,697 --> 00:05:38,588 when Deep Minds AlphaGo beats Lee Sedol, which is one of the best Go players of the last few years, 64 00:05:38,589 --> 00:05:50,918 and this is actually in the news again for, as some of you may have seen, there's another Go competition going on now with AlphaGo versus a top-ranked Go player. 65 00:05:50,919 --> 00:05:56,295 And so the objective here is to win the game, and our state is the position 66 00:05:56,295 --> 00:06:03,911 of all the pieces, the action is where to put the next piece down, and the reward is, one, if you win at the end of the game, and zero otherwise. 67 00:06:03,912 --> 00:06:08,411 And we'll also talk about this one in a little bit more detail, later. 68 00:06:08,411 --> 00:06:13,329 Okay, so how can we mathematically formalize the RL problem, right? 69 00:06:13,330 --> 00:06:18,051 This loop that we talked about earlier, of environments giving agents states, 70 00:06:18,051 --> 00:06:20,634 and then agents taking actions. 71 00:06:22,394 --> 00:06:28,512 So, a Markov decision process is the mathematical formulation of the RL problem, 72 00:06:28,512 --> 00:06:31,447 and an MDP satisfies the Markov property, 73 00:06:31,447 --> 00:06:36,107 which is that the current state completely characterizes the state of the world. 74 00:06:36,107 --> 00:06:40,164 And an MDP here is defined by tuple of objects, 75 00:06:40,164 --> 00:06:43,170 consisting of S, which is the set of possible states. 76 00:06:43,170 --> 00:06:45,762 We have A, our set of possible actions, 77 00:06:45,762 --> 00:06:51,694 we also have R, our distribution of our reward, given a state, action pair, 78 00:06:51,694 --> 00:06:55,323 so it's a function mapping from state action to your reward. 79 00:06:55,323 --> 00:06:57,430 You also have P, which is a transition probability 80 00:06:57,430 --> 00:07:02,940 distribution over your next state, that you're going to transition to given your state, action pair. 81 00:07:02,940 --> 00:07:05,718 And then finally we have a Gamma, a discount factor, 82 00:07:05,718 --> 00:07:12,970 which is basically saying how much we value rewards coming up soon versus later on. 83 00:07:14,203 --> 00:07:17,395 So, the way the Markov Decision Process works is that 84 00:07:17,395 --> 00:07:20,053 at our initial time step t equals zero, 85 00:07:20,053 --> 00:07:26,362 the environment is going to sample some initial state as zero, from the initial state distribution, p of s zero. 86 00:07:26,363 --> 00:07:32,253 And then, once it has that, then from time t equals zero until it's done, we're going to iterate through this loop 87 00:07:32,253 --> 00:07:35,797 where the agent is going to select an action, a sub t. 88 00:07:35,797 --> 00:07:38,885 The environment is going to sample a reward from here, 89 00:07:38,885 --> 00:07:44,032 so reward given your state and the action that you just took. 90 00:07:44,032 --> 00:07:51,534 It's also going to sample the next state, at time t plus one, given your probability distribution 91 00:07:51,534 --> 00:07:58,706 and then the agent is going to receive the reward, as well as the next state, and then we're going to through this process again, 92 00:07:58,707 --> 00:08:05,542 and keep looping; agent will select the next action, and so on until the episode is over. 93 00:08:05,542 --> 00:08:06,989 Okay, so 94 00:08:06,989 --> 00:08:10,724 now based on this, we can define a policy pi, 95 00:08:10,724 --> 00:08:16,651 which is a function from your states to your actions that specifies what action to take in each state. 96 00:08:16,651 --> 00:08:19,748 And this can be either deterministic or stochastic. 97 00:08:19,748 --> 00:08:27,204 And our objective now is to going to be to find your optimal policy pi star, that maximizes your cumulative discounted reward. 98 00:08:27,205 --> 00:08:35,508 So we can see here we have our some of our future rewards, which can be also discounted by your discount factor. 99 00:08:35,509 --> 00:08:39,327 So, let's look at an example of a simple MDP. 100 00:08:39,327 --> 00:08:44,533 And here we have Grid World, which is this task where we have this grid of states. 101 00:08:44,533 --> 00:08:50,295 So you can be in any of these cells of your grid, which are your states. 102 00:08:50,295 --> 00:08:52,613 And you can take actions from your states, 103 00:08:52,613 --> 00:08:59,298 and so these actions are going to be simple movements, moving to your right, to your left, up or down. 104 00:08:59,299 --> 00:09:08,858 And you're going to get a negative reward for each transition or each time step, basically, that happens. Each movement that you take, 105 00:09:08,859 --> 00:09:11,989 and this can be something like R equals negative one. 106 00:09:11,989 --> 00:09:20,054 And so your objective is going to be to reach one of the terminal states, which are the gray states shown here, in the least number of actions. 107 00:09:20,055 --> 00:09:27,624 Right, so the longer that you take to reach your terminal state, you're going to keep accumulating these negative rewards. 108 00:09:27,625 --> 00:09:30,540 Okay, so if you look at a random policy here, 109 00:09:30,540 --> 00:09:39,089 a random policy would consist of, basically, at any given state or cell that you're in just sampling randomly which direction that you're going to move in next. 110 00:09:39,090 --> 00:09:41,843 Right, so all of these have equal probability. 111 00:09:41,843 --> 00:09:46,518 On the other hand, an optimal policy that we would like to have is 112 00:09:46,518 --> 00:09:51,866 basically taking the action, the direction that will move us closest to a terminal state. 113 00:09:51,866 --> 00:09:59,170 So you can see here, if we're right next to one of the terminal states we should always move in the direction that gets us to this terminal state. 114 00:09:59,171 --> 00:10:09,118 And otherwise, if you're in one of these other states, you want to take the direction that will take you closest to one of these states. 115 00:10:09,119 --> 00:10:17,154 Okay, so now given this description of our MDP, what we want to do is we want to find our optimal policy pi star. 116 00:10:17,155 --> 00:10:20,755 Right, our policy that's maximizing the sum of the rewards. 117 00:10:20,755 --> 00:10:29,730 And so this optimal policy is going to tell us, given any state that we're in, what is the action that we should take in order to maximize the sum of the rewards that we'll get. 118 00:10:29,731 --> 00:10:34,091 And so one question is how do we handle the randomness in the MDP, right? 119 00:10:34,091 --> 00:10:39,073 We have randomness in terms of our initial state that we're sampling, 120 00:10:39,073 --> 00:10:46,340 in therms of this transition probability distribution that will give us distribution of our next states, and so on. 121 00:10:46,341 --> 00:10:51,947 Also what we'll do is we'll work, then, with maximizing our expected sum of the rewards. 122 00:10:51,947 --> 00:11:02,956 So, formally, we can write our optimal policy pi star as maximizing this expected sum of future rewards over policy's pi, where we have our initial state 123 00:11:02,957 --> 00:11:05,103 sampled from our state distribution. 124 00:11:05,103 --> 00:11:09,127 We have our actions, sampled from our policy, given the state. 125 00:11:09,127 --> 00:11:16,423 And then we have our next states sampled from our transition probability distributions. 126 00:11:16,423 --> 00:11:22,142 Okay, so before we talk about exactly how we're going to find this policy, 127 00:11:22,143 --> 00:11:26,787 let's first talk about a few definitions that's going to be helpful for us in doing so. 128 00:11:26,787 --> 00:11:31,405 So, specifically, the value function and the Q-value function. 129 00:11:31,405 --> 00:11:37,425 So, as we follow the policy, we're going to sample trajectories or paths, right, for every episode. 130 00:11:37,426 --> 00:11:43,611 And we're going to have our initial state as zero, a-zero, r-zero, s-one, a-one, r-one, and so on. 131 00:11:43,611 --> 00:11:49,331 We're going to have this trajectory of states, actions, and rewards that we get. 132 00:11:49,331 --> 00:11:52,613 And so, how good is a state that we're currently in? 133 00:11:52,613 --> 00:12:10,799 Well, the value function at any state s, is the expected cumulative reward following the policy from state s, from here on out. Right, so it's going to be expected value of our expected cumulative reward, starting from our current state. 134 00:12:10,800 --> 00:12:13,286 And then how good is a state, action pair? 135 00:12:13,286 --> 00:12:17,370 So how good is taking action a in state s? 136 00:12:17,370 --> 00:12:20,468 And we define this using a Q-value function, 137 00:12:20,468 --> 00:12:27,741 which is, the expected cumulative reward from taking action a in state s and then following the policy. 138 00:12:29,708 --> 00:12:45,098 Right, so then, the optimal Q-value function that we can get is going to be Q star, which is the maximum expected cumulative reward that we can get from a given state action pair, defined here. 139 00:12:45,099 --> 00:12:52,017 So now we're going to see one important thing in reinforcement learning, which is called the Bellman equation. 140 00:12:52,018 --> 00:12:57,697 So let's consider this a Q-value function from the optimal policy Q star, 141 00:12:57,697 --> 00:13:00,911 which is then going to satisfy this Bellman equation, 142 00:13:00,911 --> 00:13:05,194 which is this identity shown here, and what this means is that 143 00:13:05,194 --> 00:13:11,748 given any state, action pair, s and a, the value of this pair is going to be the reward 144 00:13:11,748 --> 00:13:18,867 that you're going to get, r, plus the value of whatever state that you end up in. So, let's say, s prime. 145 00:13:18,868 --> 00:13:24,150 And since we know that we have the optimal policy, then we also know that we're going to 146 00:13:24,150 --> 00:13:28,746 play the best action that we can, right, at our state s prime. 147 00:13:28,746 --> 00:13:34,432 And so then, the value at state s prime is just going to be the maximum over our actions, 148 00:13:34,432 --> 00:13:38,626 a prime, of Q star at s prime, a prime. 149 00:13:38,626 --> 00:13:44,119 And so then we get this identity here, for optimal Q-value. 150 00:13:44,119 --> 00:13:54,251 Right, and then also, as always, we have this expectation here, because we have randomness over what state that we're going to end up in. 151 00:13:54,252 --> 00:14:02,487 And then we can also infer, from here, that our optimal policy, right, is going to consist of taking the best action in any state, as specified by Q star. 152 00:14:02,488 --> 00:14:08,436 Q star is going to tell us of the maximum future reward that we can get from any of our actions, 153 00:14:08,437 --> 00:14:16,862 so we should just take a policy that's following this and just taking the action that's going to lead to best reward. 154 00:14:16,863 --> 00:14:21,025 Okay, so how can we solve for this optimal policy? 155 00:14:21,025 --> 00:14:25,692 So, one way we can solve for this is something called a value iteration algorithm, 156 00:14:25,692 --> 00:14:29,527 where we're going to use this Bellman equation as an iterative update. 157 00:14:29,527 --> 00:14:37,997 So at each step, we're going to refine our approximation of Q star by trying to enforce the Bellman equation. 158 00:14:39,347 --> 00:14:50,830 And so, under some mathematical conditions, we also know that this sequence Q, i of our Q-function is going to converge to our optimal Q star as i approaches infinity. 159 00:14:54,257 --> 00:14:58,329 And so this, this works well, but what's the problem with this? 160 00:14:59,184 --> 00:15:01,887 Well, an important problem is that this is not scalable. 161 00:15:01,887 --> 00:15:02,720 Right? 162 00:15:02,720 --> 00:15:08,596 We have to compute Q of s, a here for every state, action pair in order to make our iterative updates. 163 00:15:08,597 --> 00:15:18,932 Right, but then this is a problem if, for example, if we look at these the state of, for example, an Atari game that we had earlier, it's going to be your screen of pixels. 164 00:15:18,933 --> 00:15:28,724 And this is a huge state space, and it's basically computationally infeasible to compute this for the entire state space. 165 00:15:28,725 --> 00:15:35,907 Okay, so what's the solution to this? Well, we can use a function approximator to estimate Q of s, a 166 00:15:35,908 --> 00:15:37,620 so, for example, a neural network, right. 167 00:15:37,620 --> 00:15:48,471 So, we've seen before that any time, if we have some really complex function that don't know, that we want to estimate, a neural network is a good way to estimate this. 168 00:15:48,472 --> 00:15:54,242 Okay, so this is going to take us to our formulation of Q-learning that we're going to look at. 169 00:15:54,242 --> 00:16:02,950 And so, what we're going to do is we're going to use a function approximator in order to estimate our action value function. 170 00:16:02,118 --> 00:16:02,951 Right? 171 00:16:02,951 --> 00:16:08,141 And if this function approximator is a deep neural network, which is what's been used recently, 172 00:16:08,142 --> 00:16:10,782 then this is going to be called deep Q-learning. 173 00:16:10,782 --> 00:16:20,149 And so this is something that you'll hear around as one of the common approaches to deep reinforcement learning that's in use. 174 00:16:20,150 --> 00:16:30,765 Right, and so in this case, we also have our function parameters theta here, so our Q-value function is determined by these weights, theta, of our neural network. 175 00:16:33,050 --> 00:16:37,970 Okay, so given this function approximation, how do we solve for our optimal policy? 176 00:16:37,970 --> 00:16:44,744 So remember that we want to find a Q-function that's satisfying the Bellman equation. 177 00:16:44,744 --> 00:16:54,712 Right, and so we want to enforce this Bellman equation to happen, so what we can do when we have this neural network approximating our Q-function is that 178 00:16:54,713 --> 00:17:00,239 we can train this where our loss function is going to try and minimize the error of our Bellman equation, right? 179 00:17:00,240 --> 00:17:03,689 Or how far q of s, a is from its target, 180 00:17:03,689 --> 00:17:09,853 which is the Y_i here, the right hand side of the Bellman equation that we saw earlier. 181 00:17:09,853 --> 00:17:16,928 So, we're basically going to take these forward passes of our loss function, trying to minimize this error 182 00:17:16,929 --> 00:17:28,182 and then our backward pass, our gradient update, is just going to be you just take the gradient of this loss, with respect to our network parameter's theta. 183 00:17:28,183 --> 00:17:38,435 Right, and so our goal is again to have this effect as we're taking gradient steps of iteratively trying to make our Q-function closer to our target value. 184 00:17:38,436 --> 00:17:43,524 So, any questions about this? Okay. 185 00:17:44,537 --> 00:17:53,369 So let's look at a case study of an example where one of the classic examples of deep reinforcement learning where this approach was applied. 186 00:17:53,370 --> 00:18:01,745 And so we're going to look at this problem that we saw earlier of playing Atari games, where our objective was to complete the game with the highest score 187 00:18:01,746 --> 00:18:05,460 and remember our state is going to be the raw pixel inputs of the game state, 188 00:18:05,460 --> 00:18:12,963 and we can take these actions of moving left, right, up, down, or whatever actions of the particular game. 189 00:18:12,964 --> 00:18:21,182 And our reward at each time step, we're going to get a reward of our score increase or decrease that we got at this time step, and so our cumulative total 190 00:18:21,183 --> 00:18:27,095 reward is this total reward that we'll usually see at the top of the screen. 191 00:18:27,095 --> 00:18:32,955 Okay, so the network that we're going to use for our Q-function is going to look something like this, 192 00:18:32,955 --> 00:18:37,355 right, where we have our Q-network, with weight's theta. 193 00:18:37,355 --> 00:18:43,791 And then our input, our state s, is going to be our current game screen. 194 00:18:43,791 --> 00:18:49,509 And in practice we're going to take a stack of the last four frames, so we have some history. 195 00:18:49,509 --> 00:18:58,608 And so we'll take these raw pixel values, we'll do some, you know, RGB to gray-scale conversions, some down-sampling, some cropping, so, some pre-processing. 196 00:18:58,609 --> 00:19:04,631 And what we'll get out of this is this 84 by 84 by four stack of the last four frames. 197 00:19:04,631 --> 00:19:05,464 Yeah, question. 198 00:19:05,464 --> 00:19:09,631 [inaudible question from audience] 199 00:19:12,792 --> 00:19:20,808 Okay, so the question is, are we saying here that our network is going to approximate our Q-value function for different state, action pairs, 200 00:19:20,809 --> 00:19:24,765 for example, four of these? Yeah, that's correct. 201 00:19:24,765 --> 00:19:27,551 We'll see, we'll talk about that in a few slides. 202 00:19:27,551 --> 00:19:29,935 [inaudible question from audience] 203 00:19:29,935 --> 00:19:36,815 So, no. So, we don't have a Softmax layer after the connected, because here our goal is to directly predict our Q-value functions. 204 00:19:36,816 --> 00:19:40,582 [inaudible question from audience] Q-values. [inaudible question from audience] 205 00:19:40,583 --> 00:19:44,014 Yes, so it's more doing regression to our Q-values. 206 00:19:44,014 --> 00:19:54,083 Okay, so we have our input to this network and then on top of this, we're going to have a couple of familiar convolutional layers, and a fully-connected layer, 207 00:19:54,084 --> 00:19:59,610 so here we have an eight-by-eight convolutions and we have some four-by-four convolutions. 208 00:19:59,611 --> 00:20:05,673 Then we have a FC 256 layer, so this is just a standard kind of networK that you've seen before. 209 00:20:05,674 --> 00:20:17,414 And then, finally, our last fully-connected layer has a vector of outputs, which is corresponding to your Q-value for each action, right, given the state that you've input. 210 00:20:17,415 --> 00:20:21,770 And so, for example, if you have four actions, then here we have this four-dimensional output 211 00:20:21,770 --> 00:20:28,685 corresponding to Q of current s, as well as a-one, and then a-two, a-three, and a-four. 212 00:20:28,685 --> 00:20:33,179 Right so this is going to be one scalar value for each of our actions. 213 00:20:33,179 --> 00:20:43,072 And then the number of actions that we have can vary between, for example, 4 to 18, depending on the Atari game. 214 00:20:43,073 --> 00:20:46,709 And one nice thing here is that using this network structure, 215 00:20:46,709 --> 00:20:54,650 a single feedforward pass is able to compute the Q-values for all functions from the current state. 216 00:20:54,651 --> 00:20:56,117 And so this is really efficient. 217 00:20:56,117 --> 00:21:05,945 Right, so basically we take our current state in and then because we have this output of an action for each, or Q-value for each action, as our output layer, 218 00:21:05,946 --> 00:21:10,259 we're able to do one pass and get all of these values out. 219 00:21:10,259 --> 00:21:15,078 And then in order to train this, we're just going to use our loss function from before. 220 00:21:15,078 --> 00:21:17,661 Remember, we're trying to enforce this Bellman equation 221 00:21:17,661 --> 00:21:29,314 and so, on our forward pass, our loss function we're going to try and iteratively make our Q-value close to our target value, that it should have. 222 00:21:29,315 --> 00:21:40,705 And then our backward pass is just directly taking the gradient of this loss function that we have and then taking a gradient step based on that. 223 00:21:40,706 --> 00:21:45,639 So one other thing that's used here that I want to mention is something called experience replay. 224 00:21:45,639 --> 00:21:53,440 And so this addresses a problem with just using the plain two network that I just described, 225 00:21:53,440 --> 00:21:58,134 which is that learning from batches of consecutive samples is bad. 226 00:21:58,134 --> 00:22:09,409 And so the reason because of this, right, is so for just playing the game, taking samples of state action rewards that we have and just taking consecutive samples of these and training with these, 227 00:22:09,410 --> 00:22:16,117 well all of these samples are correlated and so this leads to inefficient learning, first of all, 228 00:22:16,118 --> 00:22:24,841 and also, because of this, our current Q-network parameters, right, this determines the policy that we're going to follow, it determines our next 229 00:22:24,842 --> 00:22:27,394 samples that we're going to get that we're going to use for training. 230 00:22:27,394 --> 00:22:30,832 And so this leads to problems where you can have bad feedback loops. 231 00:22:30,832 --> 00:22:35,468 So, for example, if currently the maximizing action that's going to take left, 232 00:22:35,468 --> 00:22:43,305 well this is going to bias all of my upcoming training examples to be dominated by samples from the left-hand side. 233 00:22:43,306 --> 00:22:45,406 And so this is a problem, right? 234 00:22:45,406 --> 00:22:53,097 And so the way that we are going to address these problems is by using something called experience replay, where we're going to keep this 235 00:22:53,098 --> 00:23:01,352 replay memory table of transitions of state, as state, action, reward, next state, transitions that we have, and we're going 236 00:23:01,353 --> 00:23:08,772 to continuously update this table with new transitions that we're getting as game episodes are played, as we're getting more experience. 237 00:23:08,773 --> 00:23:16,334 Right, and so now what we can do is that we can now train our Q-network on random, mini-batches of transitions from the replay memory. 238 00:23:16,335 --> 00:23:24,826 Right, so instead of using consecutive samples, we're now going to sample across these transitions that we've accumulated random samples of these, 239 00:23:24,827 --> 00:23:31,007 and this breaks all of the, these correlation problems that we had earlier. 240 00:23:31,007 --> 00:23:39,206 And then also, as another side benefit is that each of these transitions can also contribute to potentially multiple weight updates. 241 00:23:39,207 --> 00:23:43,652 We're just sampling from this table and so we could sample one multiple times. 242 00:23:43,652 --> 00:23:47,585 And so, this is going to lead also to greater data efficiency. 243 00:23:50,580 --> 00:23:59,165 Okay, so let's put this all together and let's look at the full algorithm for deep Q-learning with experience replay. 244 00:23:59,166 --> 00:24:03,940 So we're going to start off with initializing our replay memory 245 00:24:03,940 --> 00:24:14,829 to some capacity that we choose, N, and then we're also going to initialize our Q-network, just with our random weights or initial weights. 246 00:24:14,830 --> 00:24:21,832 And then we're going to play M episodes, or full games. This is going to be our training episodes. 247 00:24:21,832 --> 00:24:31,264 And then what we're going to do is we're going to initialize our state, using the starting game screen pixels at the beginning of each episode. 248 00:24:31,265 --> 00:24:37,814 And remember, we go through the pre-processing step to get to our actual input state. 249 00:24:37,814 --> 00:24:41,584 And then for each time step of a game that we're currently playing, 250 00:24:41,584 --> 00:24:46,268 we're going to, with a small probability, select a random action, 251 00:24:46,268 --> 00:24:53,141 so one thing that's important in these algorithms is to have sufficient exploration, 252 00:24:53,141 --> 00:24:58,559 so we want to make sure that we are sampling different parts of the state space. 253 00:24:58,559 --> 00:25:03,613 And then otherwise, we're going to select from the greedy action from the current policy. 254 00:25:03,614 --> 00:25:13,579 Right, so most of the time we'll take the greedy action that we think is a good policy of the type of actions that we want to take and states that we want to see, and with a small probability 255 00:25:13,580 --> 00:25:16,300 we'll sample something random. 256 00:25:16,300 --> 00:25:26,069 Okay, so then we'll take this action, a, t, and we'll observe the next reward and the next state. So r, t and s, t plus one. 257 00:25:26,070 --> 00:25:32,771 And then we'll take this and we'll store this transition in our replay memory that we're building up. 258 00:25:32,771 --> 00:25:35,577 And then we're going to take, we're going to train a network a little bit. 259 00:25:35,577 --> 00:25:37,550 So we're going to do experience replay 260 00:25:37,550 --> 00:25:47,213 and we'll take a sample of a random mini-batches of transitions that we have from the replay memory, and then we'll perform a gradient descent step on this. 261 00:25:47,214 --> 00:25:49,635 Right, so this is going to be our full training loop. 262 00:25:49,635 --> 00:26:03,886 We're going to be continuously playing this game and then also sampling minibatches, using experienced replay to update our weights of our Q-network and then continuing in this fashion. 263 00:26:03,887 --> 00:26:20,910 Okay, so let's see. Let's see if I can, is this playing? Okay, so let's take a look at this deep Q-learning algorithm from Google DeepMind, trained on an Atari game of Breakout. 264 00:26:20,911 --> 00:26:26,185 Alright, so it's saying here that our input is just going to be our state are raw game pixels. 265 00:26:26,185 --> 00:26:29,520 And so here we're looking at what's happening at the beginning of training. 266 00:26:29,520 --> 00:26:40,302 So we've just started training a bit. And right, so it's going to look to it's learned to kind of hit the ball, but it's not doing a very good job of sustaining it. 267 00:26:40,303 --> 00:26:42,886 But it is looking for the ball. 268 00:26:50,969 --> 00:26:55,737 Okay, so now after some more training, it looks like a couple hours. 269 00:27:00,946 --> 00:27:05,113 Okay, so now it's learning to do a pretty good job here. 270 00:27:06,190 --> 00:27:16,592 So it's able to continuously follow this ball and be able to to remove most of the blocks. 271 00:27:16,593 --> 00:27:36,203 Right, so after 240 minutes. Okay, so here it's found the pro strategy, right? 272 00:27:36,203 --> 00:27:42,795 You want to get all the way to the top and then have it go by itself. Okay, so 273 00:27:42,796 --> 00:27:49,500 this is an example of using deep Q-learning in order to train an agent to be able to play Atari games. 274 00:27:49,501 --> 00:27:56,418 It's able to do this on many Atari games and so you can check out some more of this online. 275 00:27:56,419 --> 00:28:01,149 Okay, so we've talked about Q-learning. But there is a problem with Q-learning, right? 276 00:28:01,149 --> 00:28:07,225 It can be challenging and what's the problem? Well, the problem can be that the Q-function is very complicated. 277 00:28:07,226 --> 00:28:12,335 Right, so we have to, we're saying that we want to learn the value of every state action pair. 278 00:28:12,335 --> 00:28:17,275 So, if, let's say you have something, for example, a robot grasping, wanting to grasp an object. 279 00:28:17,275 --> 00:28:19,576 Right, you're going to have a really high dimensional state. 280 00:28:19,576 --> 00:28:26,225 You have, I mean, let's say you have all of your even just joint, joint positions, and angles. 281 00:28:26,225 --> 00:28:35,492 Right, and so learning the exact value of every state action pair that you have, right, can be really, really hard to do. 282 00:28:35,493 --> 00:28:38,724 But on the other hand, your policy can be much simpler. 283 00:28:38,724 --> 00:28:44,555 Right, like what you want this robot to do maybe just to have this simple motion of just closing your hand, right? 284 00:28:44,556 --> 00:28:48,252 Just, move your fingers in this particular direction and keep going. 285 00:28:48,252 --> 00:28:54,142 And so, that leads to the question of can we just learn this policy directly? 286 00:28:54,142 --> 00:28:58,306 Right, is it possible, maybe, to just find the best policy from a collection of policies, 287 00:28:58,306 --> 00:29:06,789 without trying to go through this process of estimating your Q-value and then using that to infer your policy. 288 00:29:06,790 --> 00:29:15,937 So, this is an approach that oh, so, okay, this is an approach that we're going to call policy gradients. 289 00:29:15,938 --> 00:29:20,858 And so, formally, let's define a class of parametrized policies. 290 00:29:20,858 --> 00:29:24,146 Parametrized by weights theta, 291 00:29:24,146 --> 00:29:27,791 and so for each policy let's define the value of the policy. 292 00:29:27,791 --> 00:29:35,722 So, J, our value J, given parameters theta, is going to be, or expected some cumulative sum of future rewards that we care about. 293 00:29:35,723 --> 00:29:38,971 So, the same reward that we've been using. 294 00:29:38,971 --> 00:29:41,879 And so our goal then, under this setup 295 00:29:41,879 --> 00:29:51,547 is that we want to find an optimal policy, theta star, which is the maximum, right, arg max over theta of J of theta. 296 00:29:51,548 --> 00:29:56,917 So we want to find the policy, the policy parameters that gives our best expected reward. 297 00:29:56,917 --> 00:30:01,011 So, how can we do this? Any ideas? 298 00:30:04,993 --> 00:30:10,155 Okay, well, what we can do is just a gradient assent on our policy parameters, right? 299 00:30:10,155 --> 00:30:15,460 We've learned that given some objective that we have, some parameters we can just use gradient asscent 300 00:30:15,460 --> 00:30:20,762 and gradient assent in order to continuously improve our parameters. 301 00:30:23,202 --> 00:30:29,196 And so let's talk more specifically about how we can do this, which we're going to call here the reinforce algorithm. 302 00:30:29,196 --> 00:30:36,781 So, mathematically, we can write out our expected future reward over trajectories, and so we're going to sample 303 00:30:36,781 --> 00:30:41,902 these trajectories of experience, right, like for example episodes of game play that we talked about earlier. 304 00:30:41,902 --> 00:30:47,411 S-zero, a-zero, r-zero, s-one, a-one, r-one, and so on. 305 00:30:47,411 --> 00:30:51,723 Using some policy pi of theta. 306 00:30:51,723 --> 00:30:57,739 Right, and then so, for each trajectory we can compute a reward for that trajectory. 307 00:30:57,739 --> 00:31:01,245 It's the cumulative reward that we got from following this trajectory. 308 00:31:01,245 --> 00:31:10,570 And then the value of a policy, pi sub theta, is going to be just the expected reward of these trajectories that we can get from the following pi sub theta. 309 00:31:10,570 --> 00:31:16,868 So that's here, this expectation over trajectories that we can get, sampling trajectories from our policy. 310 00:31:18,563 --> 00:31:21,288 Okay. So, we want to do gradient ascent, right? 311 00:31:21,288 --> 00:31:22,961 So let's differentiate this. 312 00:31:22,961 --> 00:31:27,356 Once we differentiate this, then we can just take gradient steps, like normal. 313 00:31:28,535 --> 00:31:34,300 So, the problem is that now if we try and just differentiate this exactly, this is intractable, right? 314 00:31:34,300 --> 00:31:41,319 So, the gradient of an expectation is problematic when p is dependent on theta here, because here 315 00:31:41,319 --> 00:31:47,661 we want to take this gradient of p of tau, given theta, 316 00:31:47,661 --> 00:31:53,033 but this is going to be, we want to take this integral over tau. Right, so this is intractable. 317 00:31:53,033 --> 00:31:57,327 However, we can use a trick here to get around this. 318 00:31:57,327 --> 00:32:04,941 And this trick is taking this gradient that we want, of p. We can rewrite this by just multiplying this by one, 319 00:32:04,941 --> 00:32:10,286 by multiplying top and bottom, both by p of tau given theta. 320 00:32:10,286 --> 00:32:26,170 Right, and then if we look at these terms that we have now here, in the way that I've written this, on the left and the right, this is actually going to be equivalent to p of tau times our gradient with respect to theta, of log, of p. 321 00:32:26,170 --> 00:32:32,741 Right, because the gradient of the log of p is just going to be one over p times gradient of p. 322 00:32:33,808 --> 00:32:41,385 Okay, so if we then inject this back into our expression that we had earlier for this gradient, 323 00:32:41,385 --> 00:32:43,426 we can see that, what this will actually look like, 324 00:32:43,426 --> 00:32:52,187 right, because now we have a gradient of log p times our probabilities of all of these trajectories and then taking this integral here, over tau. 325 00:32:52,187 --> 00:32:58,586 This is now going to be an expectation over our trajectories tau, and so what we've done here 326 00:32:58,586 --> 00:33:02,751 is that we've taken a gradient of an expectation 327 00:33:02,751 --> 00:33:06,823 and we've transformed it into an expectation of gradients. 328 00:33:06,823 --> 00:33:13,817 Right, and so now we can use sample trajectories that we can get in order to estimate our gradient. 329 00:33:14,712 --> 00:33:21,260 And so we do this using Monte Carlo sampling, and this is one of the core ideas of reinforce. 330 00:33:23,624 --> 00:33:28,180 Okay, so looking at this expression that we want to compute, 331 00:33:28,180 --> 00:33:33,071 can we compute these quantities that we had here without knowing the transition probabilities? 332 00:33:33,071 --> 00:33:38,466 Alright, so we have that p of tau is going to be the probability of a trajectory. 333 00:33:38,466 --> 00:33:45,821 It's going to be the product of all of our transition probabilities of the next state that we get, given our current state and action 334 00:33:45,821 --> 00:33:52,232 as well as our probability of the actions that we've taken under our policy pi. 335 00:33:52,232 --> 00:33:58,441 Right, so we're going to multiply all of these together, and get our probability of our trajectory. 336 00:33:58,441 --> 00:34:10,389 So this log of p of tau that we want to compute is going to be we just take this log and this will separate this out into a sum of pushing the logs inside. 337 00:34:10,389 --> 00:34:18,161 And then here, when we differentiate this, we can see we want to differentiate with respect to theta, but this first term that we have here, 338 00:34:18,163 --> 00:34:22,850 log p of the state transition probabilities there's no theta term here, and so 339 00:34:22,850 --> 00:34:28,709 the only place where we have theta is the second term that we have, of log of pi sub theta, 340 00:34:29,675 --> 00:34:39,669 of our action, given our state, and so this is the only term that we keep in our gradient estimate, and so we can see here that this doesn't depend on the transition probabilities, 341 00:34:39,670 --> 00:34:46,421 right, so we actually don't need to know our transition probabilities in order to computer our gradient estimate. 342 00:34:47,257 --> 00:34:57,264 And then, so, therefore when we're sampling these, for any given trajectory tau, we can estimate J of theta using this gradient estimate. 343 00:34:58,524 --> 00:35:02,220 This is here shown for a single trajectory from what we had earlier, 344 00:35:02,220 --> 00:35:07,188 and then we can also sample over multiple trajectories to get the expectation. 345 00:35:09,248 --> 00:35:17,141 Okay, so given this gradient estimator that we've derived, the interpretation that we can make from this here, is that 346 00:35:18,217 --> 00:35:25,226 if our reward for a trajectory is high, if the reward that we got from taking the sequence of actions was good, 347 00:35:25,226 --> 00:35:29,434 then let's push up the probabilities of all the actions that we've seen. 348 00:35:29,434 --> 00:35:33,141 Right, we're just going to say that these were good actions that we took. 349 00:35:33,141 --> 00:35:37,186 And then if the reward is low, we want to push down these probabilities. 350 00:35:37,186 --> 00:35:40,747 We want to say these were bad actions, let's try and not sample these so much. 351 00:35:40,747 --> 00:35:50,980 Right and so we can see that's what's happening here, where we have pi of a, given s. This is the likelihood of the actions that we've taken 352 00:35:50,980 --> 00:36:03,575 and then we're going to scale this, we're going to take the gradient and the gradient is going to tell us how much should we change the parameters in order to increase our likelihood of our action, a, right? 353 00:36:03,575 --> 00:36:12,602 And then we're going to take this and scale it by how much reward we actually got from it, so how good were these actions, in reality. 354 00:36:14,561 --> 00:36:23,798 Okay, so this might seem simplistic to say that, you know, if a trajectory is good, then we're saying here that all of its actions were good. Right? 355 00:36:23,798 --> 00:36:26,356 But, in expectation, this actually averages out. 356 00:36:26,356 --> 00:36:30,125 So we have an unbiased estimator here, 357 00:36:30,125 --> 00:36:35,622 and so if you have many samples of this, then we will get an accurate estimate of our gradient. 358 00:36:35,622 --> 00:36:42,994 And this is nice because we can just take gradient steps and we know that we're going to be improving our loss function and getting closer 359 00:36:42,994 --> 00:36:48,602 to, at least some local optimum of our policy parameters theta. 360 00:36:48,602 --> 00:36:54,884 Alright, but there is a problem with this, and the problem is that this also suffers from high variance. 361 00:36:54,884 --> 00:36:57,201 Because this credit assignment is really hard. 362 00:36:57,201 --> 00:37:04,412 Right, we're saying that given a reward that we got, we're going to say all of the actions were good, we're just going to hope 363 00:37:04,412 --> 00:37:11,080 that this assignment of which actions were actually the best actions, that mattered, are going to average out over time. 364 00:37:11,080 --> 00:37:17,190 And so this is really hard and we need a lot of samples in order to have a good estimate. 365 00:37:17,190 --> 00:37:23,851 Alright, so this leads to the question of, is there anything that we can do to reduce the variance and improve the estimator? 366 00:37:26,540 --> 00:37:33,323 And so variance reduction is an important area of research in policy gradients, 367 00:37:33,323 --> 00:37:39,756 and in coming up with ways in order to improve the estimator and require fewer samples. 368 00:37:39,756 --> 00:37:43,278 Alright, so let's look at a couple of ideas of how we can do this. 369 00:37:44,202 --> 00:37:46,764 So given our gradient estimator, 370 00:37:46,764 --> 00:37:57,091 so the first idea is that we can push up the probabilities of an action only by it's affect on future rewards from that state, right? 371 00:37:57,091 --> 00:38:04,736 So, now with instead of scaling this likelihood, or pushing up this likelihood of this action by the total reward of its trajectory, 372 00:38:04,736 --> 00:38:12,108 let's look more specifically at just the sum of rewards coming from this time step on to the end, right? 373 00:38:12,108 --> 00:38:18,999 And so, this is basically saying that how good an action is, is only specified by how much future reward it generates. 374 00:38:18,999 --> 00:38:20,499 Which makes sense. 375 00:38:21,811 --> 00:38:29,448 Okay, so a second idea that we can also use is using a discount factor in order to ignore delayed effects. 376 00:38:29,448 --> 00:38:36,774 Alright so here we've added back in this discount factor, that we've seen before, which is saying that 377 00:38:36,774 --> 00:38:47,276 we are, you know, our discount factor's going to tell us how much we care about just the rewards that are coming up soon, versus rewards that came much later on. 378 00:38:47,276 --> 00:38:57,489 Right, so we were going to now say how good or bad an action is, looking more at the local neighborhood of action set it generates in the immediate near future 379 00:38:57,489 --> 00:39:00,880 and down weighting the the ones that come later on. 380 00:39:00,880 --> 00:39:07,730 Okay so these are some straightforward ideas that are generally used in practice. 381 00:39:07,730 --> 00:39:14,597 So, a third idea is this idea of using a baseline in order to reduce your variance. 382 00:39:14,597 --> 00:39:20,690 And so, a problem with just using the raw value of your trajectories, is that 383 00:39:21,675 --> 00:39:23,869 this isn't necessarily meaningful, right? 384 00:39:23,869 --> 00:39:29,835 So, for example, if your rewards are all positive, then you're just going to keep pushing up the probabilities of all your actions. 385 00:39:29,835 --> 00:39:32,039 And of course, you'll push them up to various degrees, 386 00:39:32,039 --> 00:39:39,753 but what's really important is whether a reward is better or worse than what you're expecting to be getting. 387 00:39:39,753 --> 00:39:46,071 Alright, so in order to address this, we can introduce a baseline function that's dependent on the state. 388 00:39:46,071 --> 00:39:53,886 Right, so this baseline function tell us what's, how much we, what's our guess and what we expect to get from this state, and then 389 00:39:55,515 --> 00:39:59,837 our reward or our scaling factor that we're going to use to be pushing up or down our probabilities, 390 00:39:59,837 --> 00:40:05,508 can now just be our expected sum of future rewards, minus this baseline, so now it's the relative of how 391 00:40:05,508 --> 00:40:10,772 much better or worse is the reward that we got from what we expected. 392 00:40:11,870 --> 00:40:16,099 And so how can we choose this baseline? Well, 393 00:40:16,099 --> 00:40:19,168 a very simple baseline, the most simple you can use, 394 00:40:19,168 --> 00:40:23,013 is just taking a moving average of rewards that you've experienced so far. 395 00:40:23,013 --> 00:40:34,765 So you can even do this overall trajectories, and this is just an average of what rewards have I been seeing as I've been training, and as I've been playing these episodes? 396 00:40:34,765 --> 00:40:41,716 Right, and so this gives some idea of whether the reward that I currently get was relatively better or worse. 397 00:40:42,821 --> 00:40:54,452 And so there's some variance on this that you can use but so far the variance reductions that we've seen so far are all used in what's typically called "vanilla REINFORCE" algorithm. 398 00:40:54,452 --> 00:41:00,954 Right, so looking at the cumulative future reward, having a discount factor, and some simple baselines. 399 00:41:02,601 --> 00:41:08,769 Now let's talk about how we can think about this idea of baseline and potentially choose better baselines. 400 00:41:08,769 --> 00:41:13,567 Right, so if we're going to think about what's a better baseline that we can choose, 401 00:41:13,567 --> 00:41:24,255 what we want to do is we want to push up the probability of an action from a state, if the action was better than the expected value of what we should get from that state. 402 00:41:24,255 --> 00:41:30,163 So, thinking about the value of what we're going to expect from the state, what does this remind you of? 403 00:41:30,163 --> 00:41:34,939 Does this remind you of anything that we talked about earlier in this lecture? 404 00:41:37,023 --> 00:41:41,297 Yes. [inaudible from audience] Yeah, so the value functions, right? 405 00:41:41,297 --> 00:41:47,871 The value functions that we talked about with Q-learning. So, exactly. So Q-functions and value functions 406 00:41:47,871 --> 00:41:58,173 and so, the intuition is that well, we're happy with an action, taking an action in a state s, if 407 00:41:58,173 --> 00:42:04,752 our Q-value of taking a specific action from this state is larger than 408 00:42:04,752 --> 00:42:09,698 the value function or expected value of the cumulative future reward that we can get from this state. 409 00:42:09,698 --> 00:42:14,416 Right, so this means that this action was better than other actions that we could've taken. 410 00:42:14,416 --> 00:42:22,063 And on the contrary, we're unhappy if this action, if this value or this difference is negative or small. 411 00:42:23,917 --> 00:42:34,868 Right, so now if we plug this in, in order to, as our scaling factor of how much we want to push up or down, our probabilities of our actions, then we can get this estimator here. 412 00:42:34,868 --> 00:42:40,168 Right, so, it's going to be exactly the same as before, but now where 413 00:42:40,168 --> 00:42:50,514 we've had before our cumulative expected reward, with our various reduction, variance reduction techniques and baselines in, here we can just plug in now 414 00:42:50,514 --> 00:43:00,530 this difference of how much better our current action was, based on our Q-function minus our value function from that state. 415 00:43:01,771 --> 00:43:09,413 Right, but what we talked about so far with our REINFORCE algorithm, we don't know what Q and V actually are. 416 00:43:09,413 --> 00:43:14,479 So can we learn these? And the answer is yes, using Q-learning. 417 00:43:14,479 --> 00:43:16,465 What we've already talked about before. 418 00:43:16,465 --> 00:43:22,210 So we can combine policy gradients while we've just been talking about, with Q-learning, 419 00:43:22,210 --> 00:43:28,784 by training both an actor, which is the policy, as well as a critic, right, a Q-function, 420 00:43:28,784 --> 00:43:34,380 which is going to tell us how good we think a state is, and an action in a state. 421 00:43:34,380 --> 00:43:40,633 Right, so using this in approach, an actor is going to decide which action to take 422 00:43:40,633 --> 00:43:47,708 and then the critic, or Q-function, is going to tell the actor how good its action was and how it should adjust. 423 00:43:47,708 --> 00:43:53,636 And so, and this also alleviates a little bit of the task of this critic compared to the Q-learning problems 424 00:43:53,636 --> 00:43:59,958 that we talked about earlier of having to have this learning a Q-value for every state, action pair, 425 00:43:59,958 --> 00:44:04,762 because here it only has to learn this for the state-action pairs that are generated by the policy. 426 00:44:04,762 --> 00:44:10,512 It only needs to know this where it matters for computing this scaling factor. 427 00:44:10,512 --> 00:44:18,188 Right, and then we can also, as we're learning this, incorporate all of the Q-learning tricks that we saw earlier, such as experience replay. 428 00:44:18,188 --> 00:44:24,610 And so, now I'm also going to just define this term that we saw earlier, 429 00:44:24,610 --> 00:44:38,172 Q of s of a, how much, how good was an action in a given state, minus V of s? Our expected value of how good the state is by this term advantage function. 430 00:44:38,172 --> 00:44:43,568 Right, so the advantage function is how much advantage did we get from playing this action? 431 00:44:43,568 --> 00:44:48,100 How much better the action was than expected. 432 00:44:48,100 --> 00:44:53,457 So, using this, we can put together our full actor-critic algorithm. 433 00:44:53,457 --> 00:44:56,279 And so what this looks like, is that we're going to start 434 00:44:56,279 --> 00:45:03,689 off with by initializing our policy parameters theta, and our critic parameters that we'll call phi. 435 00:45:03,689 --> 00:45:11,306 And then for each, for iterations of training, we're going to sample M trajectories, under the current policy. 436 00:45:12,185 --> 00:45:18,725 Right, we're going to play our policy and get these trajectories as s-zero, a-zero, r-zero, s-one and so on. 437 00:45:18,725 --> 00:45:21,671 Okay, and then we're going to compute the gradients that we want. 438 00:45:21,671 --> 00:45:33,465 Right, so for each of these trajectories and in each time step, we're going to compute this advantage function, and then we're going to use this advantage function, right? 439 00:45:33,465 --> 00:45:42,894 And then we're going to use that in our gradient estimator that we showed earlier, and accumulate our gradient estimate that we have for here. 440 00:45:42,894 --> 00:45:50,837 And then we're also going to train our critic parameters phi by exactly the same way, 441 00:45:50,837 --> 00:45:57,557 so as we saw earlier, basically trying to enforce this value function, right, to learn our value function, 442 00:45:57,557 --> 00:46:10,347 which is going to be pulled into, just minimizing this advantage function and this will encourage it to be closer to this Bellman equation that we saw earlier, right? 443 00:46:10,347 --> 00:46:19,361 And so, this is basically just iterating between learning and optimizing our policy function, as well as our critic function. 444 00:46:20,211 --> 00:46:26,727 And so then we're going to update the gradients and then we're going to go through and just continuously repeat this process. 445 00:46:29,271 --> 00:46:31,827 Okay, so now let's look at some examples of REINFORCE 446 00:46:31,827 --> 00:46:39,464 in action, and let's look first here at something called the Recurrent Attention Model, which is something that, 447 00:46:39,464 --> 00:46:49,146 which is a model also referred to as hard attention, but you'll see a lot in, recently, in computer vision tasks for various purposes. 448 00:46:49,146 --> 00:46:59,167 Right, and so the idea behind this is here, I've talked about the original work on hard attention, which is on image classification, and your goal is 449 00:46:59,167 --> 00:47:06,494 to still predict the image class, but now you're going to do this by taking a sequence of glimpses around the image. 450 00:47:06,494 --> 00:47:10,300 You're going to look at local regions around the image 451 00:47:10,300 --> 00:47:17,141 and you're basically going to selectively focus on these parts and build up information as you're looking around. 452 00:47:17,141 --> 00:47:24,551 Right, and so the reason that we want to do this is, well, first of all it has some nice inspiration from human perception in eye movement. 453 00:47:24,551 --> 00:47:29,225 Let's say we're looking at a complex image and we want to determine what's in the image. 454 00:47:29,225 --> 00:47:39,168 Well, you know, we might, maybe look at a low-resolution of it first, and then look specifically at parts of the image that will give us clues about what's in this image. 455 00:47:39,168 --> 00:47:49,276 And then, this approach of just looking at, looking around at an image at local regions, is also going to help you save computational resources, right? 456 00:47:50,435 --> 00:47:53,293 You don't need to process the full image. 457 00:47:53,293 --> 00:48:03,576 In practice, what usually happens is you look at a low-resolution image first, of a full image, to decide how to get started, and then you look at high-res portions of the image after that. 458 00:48:04,671 --> 00:48:15,177 And so this saves a lot of computational resources and you can think about, then, benefits of this to scalability, right, being able to, let's say process larger images more efficiently. 459 00:48:16,164 --> 00:48:20,099 And then, finally, this could also actually help with actual classification performance, 460 00:48:20,099 --> 00:48:24,760 because now you're able to ignore clutter and irrelevant parts of the image. 461 00:48:24,760 --> 00:48:29,931 Right? Like, you know, instead of always putting through your ConvNet, all the parts of your image, 462 00:48:29,931 --> 00:48:36,353 you can use this to, maybe, first prune out what are the relevant parts that I actually want to process, using my ConvNet. 463 00:48:37,237 --> 00:48:41,531 Okay, so what's the reinforcement learning formulation of this problem? 464 00:48:41,531 --> 00:48:51,117 Well, our state is going to be the glimpses that we've seen so far, right? Our what's the information that we've seen? 465 00:48:51,117 --> 00:48:55,228 Our action is then going to be where to look next in the image. 466 00:48:55,228 --> 00:49:02,842 Right, so in practice, this can be something like the x, y-coordinates, maybe centered around some fixed-sized glimpse that you want to look at next. 467 00:49:02,842 --> 00:49:12,423 And then the reward for the classification problem is going to be one, at the final time step, if our image is correctly classified, and zero otherwise. 468 00:49:14,495 --> 00:49:24,550 And so, because this glimpsing, taking these glimpses around the image is a non-differentiable operation, this is why we need to use reinforcement learning formulation, 469 00:49:25,761 --> 00:49:31,792 and learn policies for how to take these glimpse actions and we can train this using REINFORCE. 470 00:49:31,792 --> 00:49:40,891 So, given the state of glimpses so far, the core of our model is going to be this RNN that we're going to use to model the state, 471 00:49:40,891 --> 00:49:47,418 and then we're going to use our policy parameters in order to output the next action. 472 00:49:49,354 --> 00:49:54,571 Okay, so what this model looks like is we're going to take an input image. 473 00:49:54,571 --> 00:50:03,184 Right, and then we're going to take a glimpse at this image. So here, this glimpse is the red box here, and this is all blank, zeroes. 474 00:50:03,184 --> 00:50:12,193 And so we'll pass what we see so far into some neural network, and this can be any kind of network depending on your task. 475 00:50:12,193 --> 00:50:18,758 In the original experiments that I'm showing here, on MNIST, this is very simple, so you can just use a couple of small, fully-connected layers, 476 00:50:18,758 --> 00:50:26,105 but you can imagine for more complex images and other tasks you may want to use fancier ConvNets. 477 00:50:26,105 --> 00:50:28,775 Right, so you've passed this into some neural network, 478 00:50:28,775 --> 00:50:36,115 and then, remember I said we're also going to be integrating our state of, glimpses that we've seen so far, using a recurrent network. 479 00:50:36,115 --> 00:50:40,265 So, I'm just going to we'll see that later on, but this is going to go through that, 480 00:50:40,265 --> 00:50:46,094 and then it's going to output my x, y-coordinates, of where I'm going to see next. 481 00:50:46,094 --> 00:50:50,766 And in practice, this is going to be We want to output a distribution over actions, 482 00:50:50,766 --> 00:50:57,282 right, and so, what this is going to be it's going to be a gaussian distribution and we're going to output the mean. 483 00:50:57,282 --> 00:51:03,944 You can also output a mean and variance of this distribution in practice. The variance can also be fixed. 484 00:51:03,944 --> 00:51:11,854 Okay, so we're going to take this action that we're now going to sample a specific x, y location from our action distribution 485 00:51:11,854 --> 00:51:17,777 and then we're going to put this in to get the next, extract the next glimpse from our image. 486 00:51:17,777 --> 00:51:23,385 Right, so here we've moved to the end of the two, this tail part of the two. 487 00:51:23,385 --> 00:51:26,745 And so now we're actually starting to get some signal of what we want to see, right? 488 00:51:26,745 --> 00:51:32,724 Like, what we want to do is we want to look at the relevant parts of the image that are useful for classification. 489 00:51:32,724 --> 00:51:37,104 So we pass this through, again, our neural network layers, and then also through 490 00:51:38,153 --> 00:51:47,343 our recurrent network, right, that's taking this input as well as this previous hidden state, and we're going to use this to get a, so this is representing our policy, 491 00:51:47,343 --> 00:51:54,095 and then we're going to use this to output our distribution for the next location that we want to glimpse at. 492 00:51:54,095 --> 00:51:57,303 So we can continue doing this, you can see in this next glimpse here, 493 00:51:57,303 --> 00:51:59,903 we've moved a little bit more toward the center of the two. 494 00:51:59,903 --> 00:52:14,543 Alright, so it's probably learning that, you know, once I've seen this tail part of the two, that looks like this, maybe moving in this upper left-hand direction will get you more towards a center, which will also have a value, valuable information. 495 00:52:14,543 --> 00:52:17,473 And then we can keep doing this. 496 00:52:17,473 --> 00:52:26,795 And then finally, at the end, at our last time step, so we can have a fixed number of time steps here, in practice something like six or eight. 497 00:52:26,795 --> 00:52:33,350 And then at the final time step, since we want to do classification, we'll have our standard 498 00:52:33,350 --> 00:52:39,363 Softmax layer that will produce a distribution of probabilities for each class. 499 00:52:39,363 --> 00:52:44,108 And then here the max class was a two, so we can predict that this was a two. 500 00:52:44,108 --> 00:52:50,428 Right, and so this is going to be the set up of our model and our policy, and then we have our 501 00:52:50,428 --> 00:52:59,569 estimate for the gradient of this policy that we've said earlier we could compute by taking trajectories from here and using those to do back prop. 502 00:52:59,569 --> 00:53:08,698 And so we can just do this in order to train this model and learn the parameters of our policy, right? All of the weights that you can see here. 503 00:53:09,953 --> 00:53:15,257 Okay, so here's an example of a policies trained on MNIST, 504 00:53:16,710 --> 00:53:27,685 and so you can see that, in general, from wherever it's starting, usually learns to go closer to where the digit is, and then looking at the relevant parts of the digit, right? So this is pretty cool and 505 00:53:27,685 --> 00:53:30,460 this you know, follows kind of what you would expect, right, 506 00:53:30,460 --> 00:53:38,400 if you were to choose places to look next in order to most efficiently determine what digit this is. 507 00:53:40,108 --> 00:53:49,758 Right, and so this idea of hard attention, of recurrent attention models, has also been used in a lot of tasks in computer vision in the last 508 00:53:49,758 --> 00:53:54,790 couple of years, so you'll see this, used, for example, fine-grained image recognition. 509 00:53:54,790 --> 00:54:01,763 So, I mentioned earlier that one of the useful benefits of this can be also to 510 00:54:02,975 --> 00:54:10,180 both save on computational efficiency as well as to ignore clutter and irrelevant parts of the image, and when you have fine-grained 511 00:54:10,180 --> 00:54:13,092 image classification problems, you usually want both of these. 512 00:54:13,092 --> 00:54:25,777 You want to keep high-resolution, so that you can look at, you know, important differences. And then you also want to focus on these differences and ignore irrelevant parts. 513 00:54:25,777 --> 00:54:27,359 Yeah, question. 514 00:54:27,359 --> 00:54:31,526 [inaudible question from audience] 515 00:54:35,061 --> 00:54:41,482 Okay, so yeah, so the question is how is there is computational efficiency, because we also have this recurrent neural network in place. 516 00:54:41,482 --> 00:54:47,761 So that's true, it depends on exactly what's your, what is your problem, what is your network, and so on, 517 00:54:47,761 --> 00:54:52,512 but you can imagine that if you had some really hi- resolution image 518 00:54:52,512 --> 00:55:01,900 and you don't want to process the entire parts of this image with some huge ConvNet or some huge, you know, network, now you can get some savings by just 519 00:55:01,900 --> 00:55:04,669 focusing on specific smaller parts of the image. 520 00:55:04,669 --> 00:55:06,589 You only process those parts of the image. 521 00:55:06,589 --> 00:55:10,924 But, you're right, that it depends on exactly what problem set-up you have. 522 00:55:12,210 --> 00:55:14,530 This has also been used in image captioning, 523 00:55:14,530 --> 00:55:17,138 so if we're going to produce an caption for an image, 524 00:55:17,138 --> 00:55:23,197 we can choose, you know, we can have the image use this attention model to generate this caption 525 00:55:23,197 --> 00:55:28,999 and what it usually ends up learning is these policies where it'll focus on specific parts of the image, 526 00:55:28,999 --> 00:55:38,341 in sequence, and as it focuses on each part, it'll generate some words or the part of the caption referring to that part of the image. 527 00:55:38,341 --> 00:55:42,948 And then it's also been used, also tasks such as visual question answering, 528 00:55:42,948 --> 00:55:51,786 where we ask a question about the image and you want the model to output some answer to your question, for example, I don't know, 529 00:55:51,786 --> 00:55:53,840 how many chairs are around the table? 530 00:55:53,840 --> 00:56:03,040 And so you can see how this attention mechanism might be a good type of model for learning how to answer these questions. 531 00:56:05,707 --> 00:56:10,564 Okay, so that was an example of policy gradients in these hard attention models. 532 00:56:10,564 --> 00:56:16,430 And so, now I'm going to talk about one more example, that also uses policy gradients, 533 00:56:16,430 --> 00:56:18,465 which is learning how to play Go. 534 00:56:18,465 --> 00:56:24,497 Right, so DeepMind had this agent for playing Go, called AlphGo, 535 00:56:24,497 --> 00:56:30,708 that's been in the news a lot in the past, last year and this year. 536 00:56:30,708 --> 00:56:31,541 So, sorry? 537 00:56:31,541 --> 00:56:32,374 [inaudible comment from audience] 538 00:56:32,374 --> 00:56:39,258 And yesterday, yes, that's correct. So this is very exciting, recent news as well. 539 00:56:39,258 --> 00:56:57,886 So last year, a first version of AlphaGo was put into a competition against one of the best Go players of recent years, Lee Sedol, and the agent was able to beat him four to one, in a game of five matches. 540 00:56:57,886 --> 00:57:09,348 And actually, right now, just there's another match with Ke Jie, which is current world number one, and so it's best of three in China right now. 541 00:57:09,348 --> 00:57:13,436 And so the first game was yesterday. AlphaGo won. 542 00:57:13,436 --> 00:57:20,657 I think it was by just half a point, and so, so there's two more games to watch. These are all live-stream, so 543 00:57:20,657 --> 00:57:28,193 you guys, should also go online and watch these games. It's pretty interesting to hear the commentary. 544 00:57:29,225 --> 00:57:32,577 But, so what is this AlphaGo agent, right, from DeepMind? 545 00:57:32,577 --> 00:57:36,466 And it's based on a lot of what we've talked about so far in this lecture. 546 00:57:36,466 --> 00:57:42,045 And what it is it's a mixed of supervised learning and reinforcement learning, 547 00:57:42,045 --> 00:57:51,656 as well as a mix of some older methods for Go, Monte Carlo Tree Search, as well as recent deep RL approaches. 548 00:57:52,579 --> 00:57:56,986 So, okay, so how does AlphaGo beat the Go world champion? 549 00:57:56,986 --> 00:58:08,739 Well, what it first does is to train AlphaGo, what it takes as input is going to be a few featurization of the board. So it's basically, right, your board and the positions of the pieces on the board. 550 00:58:08,739 --> 00:58:10,739 That's your natural state representation. 551 00:58:10,739 --> 00:58:17,958 And what they do in order to improve performance a little bit is that they featurize this into some 552 00:58:17,958 --> 00:58:23,790 more channels of one is all the different stone colors, so this is kind of like your configuration of your board. 553 00:58:23,790 --> 00:58:31,138 Also some channels, for example, where, which moves are legal, some bias channels, some various things 554 00:58:31,138 --> 00:58:36,518 and then, given this state, right, it's going to first train a network 555 00:58:36,518 --> 00:58:40,897 that's initialized with supervised training from professional Go games. 556 00:58:40,897 --> 00:58:48,495 So, given the current board configuration or features, featurization of this, what's the correct next action to take? 557 00:58:48,495 --> 00:58:55,608 Alright, so given examples of professional games played, you know, just collected over time, 558 00:58:55,608 --> 00:59:02,605 we can just take all of these professional Go moves, train a standard, supervised mapping, from board state to action to take. 559 00:59:02,605 --> 00:59:05,365 Alright, so they take this, which is a pretty good start, 560 00:59:05,365 --> 00:59:09,227 and then they're going to use this to initialize a policy network. 561 00:59:09,227 --> 00:59:17,778 Right, so policy network, it's just going to take the exact same structure of input is your board state and your output is the actions that you're going to take. 562 00:59:17,778 --> 00:59:21,978 And this was the set-up for the policy gradients that we just saw, right? 563 00:59:21,978 --> 00:59:27,130 So now we're going to just continue training this using policy gradients. 564 00:59:27,130 --> 00:59:35,123 And it's going to do this reinforcement learning training by playing against itself for random, previous iterations. 565 00:59:35,123 --> 00:59:42,243 So self play, and the reward it's going to get is one, if it wins, and a negative one if it loses. 566 00:59:42,243 --> 00:59:47,573 And what we're also going to do is we're also going to learn a value network, so, something like a critic. 567 00:59:47,573 --> 01:00:01,475 And then, the final AlphaGo is going to be combining all of these together, so policy and value networks as well as with a Monte Carlo Tree Search algorithm, in order to select actions by look ahead search. 568 01:00:01,475 --> 01:00:19,891 Right, so after putting all this together, a value of a node, of where you are in play, and what you do next, is going to be a combination of your value function, as well as roll at outcome that you're computing from standard Monte Carlo Tree Search roll outs. 569 01:00:19,891 --> 01:00:27,453 Okay, so, yeah, so this is basically the various, the components of AlphaGo. 570 01:00:28,397 --> 01:00:33,814 If you're interested in reading more about this, there's a nature paper about this in 2016, 571 01:00:34,664 --> 01:00:47,953 and they trained this, I think, over, the version of AlphaGo that's being used in these matches is, like, I think a couple thousand CPUs plus a couple hundred GPUs, putting all of this together, 572 01:00:47,953 --> 01:00:52,120 so it's a huge amount of training that's going on, right. 573 01:00:55,659 --> 01:01:01,529 And yeah, so you guys should, follow the game this week. It's pretty exciting. 574 01:01:03,491 --> 01:01:10,858 Okay, so in summary, today we've talked about policy gradients, right, which are general. 575 01:01:10,858 --> 01:01:18,855 They, you're just directly taking gradient descent or ascent on your policy parameters, 576 01:01:18,855 --> 01:01:23,947 so this works well for a large class of problems, but it also suffers from high variance, 577 01:01:23,947 --> 01:01:28,669 so it requires a lot of samples, and your challenge here is sample efficiency. 578 01:01:28,669 --> 01:01:35,349 We also talked about Q-learning, which doesn't always work, it's harder to sometimes get it to work 579 01:01:35,349 --> 01:01:46,730 because of this problem that we talked earlier where you are trying to compute this exact state, action value for many, for very high dimensions, but when it does work, 580 01:01:47,769 --> 01:01:54,322 for problems, for example, the Atari we saw earlier, then it's usually more sample efficient than policy gradients. 581 01:01:54,322 --> 01:01:59,902 Right, and one of the challenges in Q-learning is that you want to make sure that you're doing sufficient exploration. 582 01:01:59,902 --> 01:02:04,902 Yeah? [inaudible question from audience] 583 01:02:14,313 --> 01:02:21,753 Oh, so for Q-learning can you do this process where you're, okay, where you're trying to start this off by some supervised training? 584 01:02:21,753 --> 01:02:24,924 So, I guess the direct approach for Q-learning doesn't 585 01:02:24,924 --> 01:02:27,532 do that because you're trying to regress to these 586 01:02:27,532 --> 01:02:46,021 Q-values, right, instead of policy gradients over this distribution, but I think there are ways in which you can, like, massage this type of thing to also bootstrap. Because I think bootstrapping in general or like behavior cloning is a good way to warm start these policies. 587 01:02:47,454 --> 01:02:54,213 Okay, so, right, so we've talked about policy gradients and Q-learning, and just another look at some of these, 588 01:02:54,213 --> 01:02:56,752 some of the guarantees that you have, right, with policy gradients. 589 01:02:56,752 --> 01:03:02,789 One thing we do know that's really nice is that this will always converge to a local minimum of J of theta, 590 01:03:04,339 --> 01:03:12,131 because we're just directly doing gradient ascent, and so this is often, and this local minimum is often just pretty good, right. 591 01:03:12,131 --> 01:03:17,041 And in Q-learning, on the other hand, we don't have any guarantees because here we're trying to approximate 592 01:03:17,041 --> 01:03:23,358 this Bellman equation with a complicated function approximator and so, in this case, this is the problem 593 01:03:23,358 --> 01:03:29,954 with Q-learning being a little bit trickier to train in terms of applicability to a wide range of problems. 594 01:03:31,849 --> 01:03:41,546 Alright, so today you got basically very, brief, kind of high-level overview of reinforcement learning and some major classes of algorithms in RL. 595 01:03:41,546 --> 01:03:47,577 And next time we're going to have a guest lecturer from, Song Han, who's done a lot 596 01:03:47,577 --> 01:03:52,569 of pioneering work in model compression and energy efficient deep learning, 597 01:03:52,569 --> 01:03:56,459 and so he's going to talk some of this, about some of this. 598 01:03:56,459 --> 01:03:58,459 Thank you.